The  person  charging  this  material  is  re- 
sponsible for  its  return  to  the  library  from 
which  it  was  withdrawn  on  or  before  the 
Latest  Date  stamped  below. 

Theft,    mutilation,    and    underlining    of    books    are    reasons 
for    disciplinary    action    and    may    result    in    dismissal    from 
the  University. 
To  renew  call  Telephone  Center,  333-8400 

UNIVERSITY    OF     ILLINOIS    LIBRARY    AT    URBANA-CHAMPAIGN 


maToTW 


L161— O-1096 


UIUCDCS-R-80-1042 

Ocrp.3 


UILU-ENG  80  1742 


CONSTRUCTION  OF  A  FAULT-TOLERANT  REAL-TIME  SOFTWARE  SYSTEM 

by 

Anthony  Y.  Wei 
Roy  H.  Campbell 


1981 


December  1980 


UIUCDCS-R-80-1042 


CONSTRUCTION  OF  A  FAULT-TOLERANT  REAL-TIME  SOFTWARE  SYSTEM 

by 

Anthony  Y.  Wei 
Roy  H.  Campbell 


December  1980 


DEPARTMENT  OF  COMPUTER  SCIENCE 

UNIVERSITY  OF  ILLINOIS  AT  URBANA-CHAMPAIGN 

URBANA,  ILLINOIS  61801 


Research  supported  in  part  by  NASA  Project  NSG  1471 


ABSTRACT 

This  paper  presents  an  approach  to  the  construction  of  a  fault- 
tolerant  real-time  software  system  with  high  reliability.  Top-down  modular 
design  and  program  development  by  stepwise  refinement  are  used  for  the  con- 
struction of  fault-tolerant  systems.  The  recovery  block  scheme,  deadline  mech- 
anism, and  a  majority  voting  unit  are  employed  to  achieve  the  desired  relia- 
bility for  a  real-time  process,  a  segment  or  refinement.  Criteria  identifying 
the  appropriate  fault-tolerant  schemes  for  a  segment  or  refinement  are 
described.  A  mathematical  model  estimates  the  effectiveness  of  this  approach 
using  a  linear  approximation  technique.  The  model  supports  the  application  of 
modular  design  and  stepwise  refinement  to  fault-tolerance  in  construction  of 
reliable  real-time  software. 
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1   Introduction. 

Reliable  software  has  traditionally  been  obtained  through  fault- 
avoidance  techniques.  Efforts  to  achieve  this  goal  include  top-down  modular 
design  [1],  structured  programming  [2],  walkthroughs,  testing  techniques  [3], 
design  of  proper  programming  tools,  and  program  correctness  proofs  [4]. 
Despite  these  approaches,  software  systems  cannot  be  guaranteed  to  be  fault- 
free.  Fault-tolerance  [5,  6]  is  proposed  to  complement  fault-avoidance  and 
further  improve  the  reliability  of  software  systems.  Fault-tolerance  does  not 
eliminate  the  need  for  reliable  components  and  fault-avoidance  techniques  are 
required  for  constructing  fault-tolerance  in  reliable  systems. 

A  software  system  can  provide  continuous  and  trustworthy  service  by 
tolerating  certain  faults  and  detecting  and  recovering  from  the  errors.  Such 
activity  is  particularly  valuable  in  real-time  systems  such  as  air  traffic 
control  systems  where  the  delays  caused  by  the  interruption  are  unacceptable. 

This  paper  presents  an  approach  to  the  construction  of  a  fault- 
tolerant  real-time  software  system  with  high  reliability.  Top-down  modular 
design  and  program  development  by  stepwise  refinement  [7]  are  used  for  the 
construction  of  fault-tolerant  systems.  The  recovery  block  scheme  [5,  6], 
deadline  mechanism  [8],  and  a  majority  voting  unit  are  employed  to  achieve  the 
desired  reliability  for  a  real-time  process,  a  segment  or  refinement.  Cri- 
teria identifying  the  appropriate  fault-tolerant  schemes  for  a  refinement  or 
module  are  described.  A  mathematical  model  estimates  the  effectiveness  of 
this  approach  using  a  linear  approximation  technique.  The  model  supports  the 
application  of  modular  design  and  stepwise  refinement  to  fault-tolerance  in 
construction  of  reliable  real-time  software. 
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The  recovery  block  scheme  and  the  deadline  mechanism  are  briefly 
described  in  section  2.  An  effort  to  incorporate  these  recovery  mechanisms 
into  an  existing  language  as  a  tool  for  fault-tolerant  real-time  programming 
is  introduced  in  section  3.  In  section  4  the  construction  of  a  fault-tolerant 
real-time  software  system  using  the  tools  provided  is  presented.  A  simplified 
mathematical  model  and  an  analysis  for  the  effectiveness  of  the  approach  are 
also  discussed  in  section  4. 


2   Error  Recovery  Mechanisms  in  Real-Time  Systems* 

Graceful  degradation  is  a  desirable  attribute  of  a  real-time  system. 
The  system  should  not  simply  halt  operation  in  the  event  of  some 
internal/external  errors.  Two  types  of  errors  in  real-time  software  systems 
are  considered  in  this  paper:  logical  errors  and  timing  errors.  Logical 
errors  include  design  residual  errors  and  run-time  errors  not  detected  in  the 
testing  process.  Timing  errors  only  occur  in  real-time  systems  where  timely 
execution  of  tasks  is  required,  especially  for  some  time-critical  systems.  A 
timing  error  occurs  if  a  real-time  system  violates  the  timing  constraints 
specified  for  that  system.  The  recovery  block  scheme  [5,  6]  has  been  proposed 
to  handle  the  logical  errors  and  the  deadline  mechanism  [8]  is  a  solution  to 
the  timing  error  problems. 
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2.1   Recovery  Blocks. 

The  recovery  block  scheme  [5,  6]  has  been  proposed  as  a  mechanism  to 
support  fault-tolerant  software.  A  recovery  block  consists  of  an  acceptance 
test,  a  primary  and  a  collection  of  alternates.  The  acceptance  test  is  an 
assertion  about  the  results  computed  by  the  primary  or  the  alternates.  If  the 
acceptance  test  is  false  after  the  primary  execution,  an  alternate  is  exe- 
cuted. Errors  detected  implicitly  by  the  hardware  or  software  during  the  exe- 
cution of  the  primary  or  an  alternate  automatically  set  the  acceptance  test  to 
false.  If  all  alternates  of  a  recovery  block  fail  to  satisfy  the  acceptance 
test,  a  software  error  is  raised. 

The  idea  of  providing  many  alternates  is  very  similar  to  providing 
stand-by  sparing  circuits  in  hardware.  However,  the  redundancy  required  in 
software  is  not  merely  the  replication  of  a  program  but  redundancy  of  design. 
The  alternates  may  perform  the  same  function  as  the  primary  but  use  different 
algorithms.  During  the  execution  of  the  primary  or  an  alternate,  the  hardware 
automatically  saves  the  "old"  contents  of  modified  global  variables  In  a 
recovery  cache  [9].  If  the  primary  or  an  alternate  fails,  the  recovery  mecha- 
nism restores  those  old  contents  from  the  recovery  cache  and  returns  the  pro- 
gram to  a  recovery  point  which  Is  the  state  before  entering  the  recovery 
block.  Another  alternate,  if  any,  is  then  initiated. 

Recovery  blocks  do  provide  alternative  algorithms  to  cope  with  general 
logical  errors  but  can  not  handle  timing  errors  properly.  Recovery  blocks  are 
insensitive  to  the  passage  of  time  and  cannot  be  used  to  ensure  that  a 
response  is  generated  by  a  specific  deadline.  (That  is,  it  is  impossible  to 
recover  from  a  missed  deadline  by  resetting  the  system  clock  and  executing  an 
alternate.)  We  need  another  mechanism  to  handle  the  timing  errors  In  real-time 
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systems* 


2.2   Deadline  Mechanism* 

The  deadline  mechanism,  although  similar  to  the  recovery  block  scheme, 
performs  a  different  function.  The  deadline  mechanism  includes  scheduling 
mechanisms  to  ensure  timely  responses. 

The  deadline  mechanism  associates  two  algorithms  with  each  particular 
service  component.  The  primary  algorithm  produces  a  better  quality  service 
than  the  alternate.  The  alternate  is  a  simpler,  deterministic  algorithm  which 
always  produces  an  acceptable  result  in  a  known,  fixed  length  of  time.  The 
deadline  mechanism  provides  fault-tolerance  in  a  real-time  system  by  ensuring 
that  either  the  primary  or  the  alternate  will  complete  before  the  deadline.  A 
set  of  simulations  demonstrating  the  feasibility  of  fault-tolerance  in  real- 
time systems  is  described  in  [8]. 

Reliable  scheduling  of  primaries  or  alternates  to  meet  real-time  con- 
straints requires  the  calculation  of  the  execution  time  of  each  algorithm. 
Accurate  determination  of  execution  times  is  critical  to  system  performance 
and  reliability.  Since  upper  bounds  on  the  execution  times  for  the  alternate 
algorithms  are  assumed,  the  deadline  mechanism  has  been  designed  so  that  the 
scheduler  reserves  a  time  for  execution  of  the  alternate  as  requests  arrive. 
Primaries  are  scheduled  in  any  remaining  time  which  is  called  slack  time. 
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3   Programming  Tools  for  Real-Time  Systems. 

Reliability  can  be  improved  by  the  development  of  high  level  program- 
ming languages  to  support  real-time  systems  [10].  HAL/S  [11]  has  been  used  to 
program  the  Space  Shuttle  control  system.  Modula  [12],  Concurrent  Pascal 
[13],  and  Path  Pascal  [14,  15]  are  all  efforts  in  this  direction.  The  Ada 
[16]  language  may  be  also  used  for  real-time  programming  but  as  yet  is  una- 
vailable. They  all  offer  sequential  features  to  improve  reliable  encoding 
such  as  modularity,  data  abstraction,  control  structures,  etc.  Concurrent 
features  (called  tasks  or  processes)  are  also  provided  in  these  high  level 
languages  to  facilitate  systems  programming.  A  process  or  task  is  an  indepen- 
dent execution  sequence.  Processes  or  tasks  can  interact  and  are  coordinated 
by  performing  operations  on  shared  variables  through  some  high  level  synchron- 
ization mechanisms  (e.g.,  Monitors  [17]  and  Path  Expressions  [18]).  These 
languages  also  permit  1/0  device  programming  using  hardware  dependent 
features.  However,  few  include  fault-tolerant  mechanisms.  Here,  we  briefly 
describe  our  efforts  to  incorporate  the  error  recovery  mechanisms  mentioned  in 
the  previous  section  into  the  Path  Pascal  language. 

3» 1   Recovery  Blocks  in  Path  Pascal. 

The  syntax  of  recovery  blocks  [6]  as  incorporated  in  Path  Pascal  is 
shown  below: 
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ensure  <acceptance  test> 

by      <procedure  name>   "primary" 

else  by  <procedure  name>   "1st  alternate" 


else  by  <procedure  name>   "nth  alternate" 
else  error ; 


The  acceptance  test  Is  a  Boolean  expression  which  is  evaluated  after 
the  execution  of  the  primary.  If  the  result  is  true,  the  statement  following 
the  recovery  block  is  executed.  However,  should  the  result  be  false,  the 
state  of  the  computation  is  restored  to  that  at  entry  to  the  recovery  block 
and  the  first  alternate  is  tried  and  so  on.  If  all  the  alternates  fail  to 
produce  acceptable  results,  then  this  is  regarded  as  a  failure  of  the  entire 
recovery  block.  Since  recovery  blocks  may  be  nested,  recovery  actions  for  a 
failed  recovery  block  must  be  undertaken  by  an  enclosing  recovery  block  if 
any. 

The  recovery  cache  mechanism  has  been  implemented  in  the  Path  Pascal 
run-time  system.  A  recovery  cache  is  basically  organized  as  a  stack  and  is 
used  to  store  the  address  and  the  old  content  of  global  variables  modified 
during  the  execution  of  a  recovery  block.  Upon  a  failure  of  the  acceptance 
test,  the  modified  global  variables  will  be  restored  to  the  state  prior  to 
entering  the  recovery  block.  Each  process  has  its  own  recovery  cache.  The 
storage  for  recovery  cache  is  allocated  when  a  process  is  instantiated. 

The  domino  effect,  first  described  in  [61,  is  an  uncontrolled  propaga- 
tion of  state  restoration  among  interacting  processes.  A  domino-free  system 
can  be  achieved  by  the  approach  similar  to  "programmer-transparent  coordina- 
tion of  recovering  parallel  processes"  presented  in  [19].  By  introducing  some 
additional  recovery  points  into  the  process  that  receives  messages  from  other 
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processes,  a  domino  effect  can  be  eliminated. 


3.2   Deadline  Mechanism  in  Path  Pascal. 

Path  Pascal  has  been  extended  to  include  the  fault-tolerant  deadline 
mechanism.  The  extensions  should  not  be  considered  as  a  language  design, 
rather  as  experimental  features  which  have  allowed  the  programming  of  example 
reliable  real-time  systems. 

The  language  construct  for  the  fault-tolerant  deadline  mechanism 
incorporated  into  Path  Pascal  is  termed  a  deadline  process.  Each  deadline 
process  provides  fault-tolerant  deadline  service.  The  repeat  statement  in 
Path  Pascal  has  been  augmented  to  have  an  optional  phrase  every  to  specify  the 
request  period  (time  between  successive  requests)  for  periodic  processes.  A 
within  statement  must  be  included  in  a  deadline  process  to  specify  the 
response  period  (time  within  which  the  system  must  respond  to  a  request). 
Within  the  deadline  process  one  service  statement  appears,  which  specifies  the 
primary  and  alternate  algorithms  that  are  to  be  used  to  satisfy  the  request. 

The  syntax  and  semantic  of  the  deadline  mechanism  in  Path  Pascal  can 
be  best  explained  by  a  simple  example  as  follows: 
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deadline  process  positlonupdate; 

procedure  compute; 

begin 

(*  primary  algorithm  *) 
end; 

procedure  approximate; 
begin 

(*  alternative  algorithm  *) 
end; 

begin  (*  positlonupdate  *) 

repeat  every  twoseconds 

within  onesecond  do 

service  compute 

else  approximate; 
until  navigation_terminated; 
end; 


In  the  example  above,  compute  Is  a  primary  and  approximate  an  alter- 
nate to  do  the  position  update.  Deadline  process  "positlonupdate"  receives 
requests  at  a  rate  of  one  every  two  seconds.  The  deadline  process  must  pro- 
duce a  response  within  one  second  upon  request.  The  execution  of  the  service 
statement  sets  a  timer  which  is  used  to  detect  timing  errors  in  the  primary. 
The  timer  allows  sufficient  time  for  the  processor  to  execute  the  alternate. 
The  results  from  the  primary  will  be  preferred  to  the  alternate's  if  the  pri- 
mary completes  successfully. 

The  deadline  of  a  process  is  calculated  from  the  response  period  when 
a  service  statement  Is  executed.  A  request  is  generated  by  the  system  clock 
for  periodic  processes  based  on  their  request  periods.  Non-periodic  processes 
use  the  minimum  value  among  request  periods  as  a  specification  for  all 
requests.  The  "every"  phrase  in  the  request  statement  is  replaced  by  "at 
least".  Synchronization  primitives  are  used  to  generate  a  request  for  non- 
periodic  processes.  If  a  request  arrives  too  soon,  it  is  delayed  until  the 
correct  request  time  calculated  according  to  the  request  period. 
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Alternates  are  scheduled  according  to  the  rate-monotonic  priority 
assignment  [20]  using  the  response  period  as  a  static  priority,  with  small 
values  having  high  priority.  At  any  given  instant  the  alternate  with  the 
highest  priority  (smallest  response  period)  executes.  This  scheme  requires 
preemption  of  alternates.  Such  a  system  is  timely  if  processor  utilization  is 
less  than  In  2  (approx.  0.69)  [20].  The  primaries  are  scheduled  in  earliest- 
deadline-first  order.  Within  a  request  period,  the  alternate  is  executed 
first.  This  is  called  first-chance  scheduling  algorithm.  Another  optimal 
scheduling  algorithm  which  maximizes  the  number  of  the  primaries  executed  for 
a  set  of  periodic  processes  is  developed  and  reported  in  [21]. 

The  recovery  caches  of  the  recovery  block  are  also  used  for  the  dead- 
line mechanism  and  can  be  shared.  Recovery  cache  algorithms  have  been  modi- 
fied to  hold  results  from  the  alternate  until  the  primary  either  fails  or  com- 
pletes successfully  in  the  deadline  mechanism. 

A  compiler  option  also  provides  an  estimate  of  the  execution  time  for 
compiled  functions,  procedures  and  processes.  The  compiler  checks  the  execu- 
tion time  for  the  code(s)  included  in  the  "within"  statement  and  verifies  that 
it  is  less  than  the  corresponding  response  period.  A  global  verification  of 
all  deadline  processes  meeting  deadlines  is  also  made  by  the  compiler. 

A  satellite  on-board  computer  system  simulation  using  deadlines  in 
Path  Pascal  has  been  reported  in  [22].  The  results  show  the  potential  of  the 
deadline  mechanism  to  describe  a  time-critical  real-time  system. 
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4   Top-Dovn  Modular  Design* 

In  this  section,  a  top-down  modular  design  of  a  real-time  system  is 
presented.  A  mathematical  model  for  periodic  real-time  systems  estimates  the 
effectiveness  of  the  approach. 


4.1   Concurrent  Processes. 

Beal-time  programs  often  monitor  and  control  several  external 
processes.  This  leads  to  decomposition  of  a  real-time  system  into  a  disjoint 
set  of  processes.  This  disjoint  set  of  processes  may  be  expressed  as  operating 
in  parallel.  The  description  of  a  real-time  system  as  distinct  concurrent 
processes  may  be  simpler  than  as  a  single  integrated  sequential  task.  Con- 
current features  in  Path  Pascal  support  such  decomposition.  Furthermore,  tim- 
ing constraints  are  assumed  to  impose  upon  each  process.  A  timing  constraint 
specifies  how  often  the  process  should  run  (request  period)  and  how  soon  the 
process  should  respond  (response  period).  The  augmented  options  in  the 
"repeat"  statement  and  the  "within"  statement  in  Path  Pascal  can  be  used  to 
achieve  this  goal. 

For  simplicity  of  analysis,  it  is  assumed  in  this  paper  that  a  real- 
time software  system  can  be  expressed  by  a  set  of  periodic  processes  and  each 
process's  request  period  is  a  multiple  of  the  next  smallest  request  period. 
To  ensure  the  timely  responses  of  the  whole  real-time  system,  the  deadline 
mechanism  is  applied  to  each  periodic  process.  The  "service"  statement  embed- 
ded in  a  "deadline  process"  in  Path  Pascal  can  be  employed  to  perform  this 
function. 
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Now  suppose  the  underlying  real-time  software  system  consists  of  M 

(>0)  periodic  processes  with  request  periods  t.,  t_,  ...  t^,  respectively. 

Assume  t.  <  t_  <  ....  <  t^  and  t    is  a  multiple  of  t  for  i  »  1,   2,   ..., 

M-l. 

Let  n  '  ~r-         i  "  1»  2,  M 

i 
and  suppose  f .  is  the  failure  probability  of  process  i  during  one  request 

period  t..  Note  that  the  failure  here  includes  the  failures  caused  by  logical 

errors  or  timing  errors.  The  failure  probability  of  the  total  system  during 

one  request  period  t^,  denoted  by  F,  can  be  approximated  as  follows: 

During  the  period  of  t  ,  process  i  will  be  executed  n   times.   The 

ni 
reliability  of  the  process  i  during  t_,  will  be  (1-f.)   .  For  the  system  con- 

M       n. 

sisting  of  M  processes,  the  reliability  during  t^  is  then  II  (1-f  ) 

^       i-1    * 

So  the  failure  probability  F  will  be 

M     n. 
F  -  1  -  It(l-f  ) 
i-1   1 
-  1  -  (1-n  ft+. ..  )  (l-n2f2+. ..  )....  (1-11^+.  ..  ) 

«  n.f.  +  n^f^  +  ...  +  T\,fM  ±   some  higher  order  terms 

Since  f.  is  very  small  (i.e.,  0  <  f  «  1),  the  higher  order  terms  can 

be  ignored.   Then  we  have 

M 
F  5  I   n  *f  (1) 

i-1  *  X 

A. 2   Stepwise  Refinement. 

The  code  in  a  process  can  be  considered  as  a  sequential  program.  The 
stepwise  refinement  techniques  now  can  be  applied  to  the  construction  of  a 
real-time  process.  Each  process  in  the  underlying  real-time  system  can  be 
further  decomposed  to  a  set  of  segments  in  a  structured  way.  Three  basic 
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structures  are  used  extensively  for  the  refinement,  i.e.,  sequence,  if-then- 
else,  and  while  loop.  Fig.  1  shows  their  basic  flow  structures. 


Sequence  If -then-else 

Fig.  1   Basic  structures  for  refinement 


While  loop 


Note  that  either 


□  «  O 


represents  a  segment.  Assume  pro- 


cess i  contains  N  segments  of  which  each  has  failure  probability  f  ,  j  =  1, 

2 N  .   Let  e   be  the  execution  probability  associated  with  segment  j   in 

process  i.  (e .  .  represents  the  conditional  probability  that  segment  j  will  be 
executed  during  the  execution  of  process  i.)  The  failure  probability  of  pro- 
cess i,  f  ,  can  then  be  approximated  as: 


N, 


f  2  Z     e  *f 
xi  £  eij  rij 


•(2) 


The  formal  proof  can  be  found  in  [23]  Appendix  B,  though  the  notations 
used  in  [23]  are  different.  Here  we  briefly  sketch  the  proof  by  illustrating 
a  simple  example  of  a  real-time  process  shown  in  Fig.  2. 
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Fig.  2  Real-time  process  i  -  a  simple  example 


In  the  example  above,  there  are  3  possible  paths  between  the  beginning 

and  the  end  of  the  process  (within  one  request  period).  Let  tt   be  the  proba- 

mn 

bility  that  segment  n  will  be  executed  given  the  fact  that  segment  m  has  been 
executed.  Each  segment  execution  probability  (e  )  and  path  execution  proba- 
bility can  then  be  expressed  in  terms  of  tt's.   In  Fig.  2, 


eil  "  *81  -  1 


'12 


*12eil  "  *12 


> 


ei8  "  "l/ 23*35*57*78  +  *  12*23*35*58  +  *12*2A*46*68  "  X 


(3) 
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and  Pr(path  1)   -  *  12*  23*  35*  57*  78 

Pr(path  2)   -  *12»23*35*58 


Pr(path  3)  -  *12* 24*46* 


68 


■(4) 


Let  Pr (failure | path  j)  be  the  failure  probability  during  the  execution 
of  path  j.  For  path  1  we  have 

Pr  (failure  |  path  1)  =  f±J  +  (l-*^)*^  +  (1"fn)(1"f  i2)f  i3  +  ••••  + 

(l-fil)(l-fi2)(l-fi3)(l-fi5)(l-fi7)f18 
■  f.,  +  f .?  +^*-i  +^is  +^17  +^*ft  ±   some  higher  order  terms 


A  linear  approximation  results  in 

Pr( failure |  path  1)  3  f^  +  f±2  +  f±3  +  f±5  +  f±J  +  i±Q 

Similarly, 

Pr  (failure  |  path  2)  s  f±J  +  f±2  +  f^  +  f±5  +  t±Q 


and     Pr (failure | path  3)  =  f  .  +  f  .  +  f  .  +  f  ,   +  f 


i8 


> (5) 


Since  f  -  Z   Pr (failure | path  j)*Pr(path  j) 
J-l 


-(6) 


Substitute  (4)  and  (5)  into  (6)  and  group  the  terms  for  each  f,.*   If 
we  compare  the  coefficients  of  f.  's  with  (3),  we  then  have 

Ni 

f  3  E  e  *f      where  N  -  8. 

Combine  (1)  and  (2),  in  general,  we  have 

M    Ni   • 
F  3  E  n  E  e  *f 
i-1  1j-l  1J  1J 

Let  g.   ■  n.e..  we  then  obtain 


M   Ni 


F  3  E    Z       2   *f 

*11  11 

1-1  j-l  XJ  XJ 


(7) 


Now  consider  the  recovery  block  scheme  Is  applied  to  segment  j   in 
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process  i.  Assume  there  are  2  alternates  in  the  recovery  and  the  failure  pro- 
bability of  the  primary  or  alternates  is  denoted  by  f  ,  ,  (k  -  1,  2,  3).  Let 
p  be  the  failure  probability  of  the  corresponding  acceptance  test  (i.e.,  the 
probability  that  the  acceptance  test  itself  contains  some  logical  errors). 
The  flow  of  a  recovery  block  suggests  that  either  the  acceptance  test  failure 
or  the  exhaustion  of  the  alternates  would  cause  a  failure  of  the  segment.  So 
we  have 

£U  "  PU  +  fijl(1-pij)pij  +  fiJlfij2(1-Pl/Pij  + 
'ljlflJ2flj3<1-"ij)3  -— <8> 

This  is  different  from  the  "transition  model  for  an  application 
routine"  developed  in  [24]  where  transitions  of  a  recovery  block  during  some 
arbitrary  interval  are  considered.  Formula  (8)  implies  that  f  is  limited  by 
the  value  of  p.  ..  If  the  acceptance  test  can  be  so  designed  that  p  is  very 
closed  to  zero,  then  the  failure  probability  of  the  segment  is  approximated  by 
the  product  of  the  failure  probability  of  the  primary  and  alternates.  Such  an 
acceptance  test  can  be  obtained  if  the  inverse  of  the  segment  function  exists 
or  the  results  from  the  segment  can  be  clearly  predicted.  However,  that  kind 
of  acceptance  is  very  hard  to  achieve  in  many  cases.  Another  approach  is  pro- 
posed to  replace  the  recovery  block  scheme. 

Consider  a  two-out-of-three  voting  unit  which  has  three  alternates  and 
one  voting  component.  The  alternates  are  very  similar  to  those  in  recovery 
blocks  except  they  must  perform  exactly  the  same  functions  (using  different 
algorithms).  The  voting  component  can  be  much  simpler  than  the  acceptance 
test  in  recovery  blocks.  The  logic  of  the  voting  component  compares  the 
results  produced  by  three  alternates  and  makes  a  majority  vote.  The  major 
difference  between  the  recovery  block  scheme  and  the  majority  voting  unit   is 
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that  once  the  acceptance  is  passed,  no  further  alternate  will  be  tried  in  the 
former  scheme  but  all  alternates  must  be  executed  in  the  latter  case. 

A  possible  syntax  for  the  majority  voting  unit  is  as  follows: 

majority  vote  from 

<alternate  1> 
and  <alternate  2> 
and     <alternate  3>; 

One  possible  implementation  holds  the  results  from  each  alternate  in  a 
recovery  cache.  Global  variables  are  updated  only  if  any  two  alternates  pro- 
duce agreeable  results  by  comparing  recovery  caches  (paying  careful  attention 
to  comparison  of  floating  point  variables). 

Suppose  this  mechanism  is  applied  to  segment  j  in  process  i.  Since 
the  logic  of  the  voting  component  is  so  simple  the  failure  probability  is 
assumed  to  be  zero.  Using  the  same  notations  employed  in  recovery  blocks  for 
the  failure  probability  of  the  alternates,  we  then  have 

fiJ  "  £iJlflJ2fU3  +  <1-fiJl)fiJ2flJ3  +  «-fiJ2>fijlfiJ3  + 
<1-flJ3)flJlflj2  (9) 

By  comparing  (8)  and  (9),  we  can  conclude  that  if  the  acceptance  test 
of  a  recovery  block  cannot  be  designed  with  confidence  then  a  majority  voting 
unit  should  be  used,  provided  that  alternates  perform  the  same  functions. 


4.3   Analysis  of  the  Model. 

The  model  shows  where  the  critical  paths  are.  The  larger  the  value 
g.  is  in  (7),  the  more  critical  the  software  segment  is.  In  other  words,  the 
segment  with  high  execution  probability  residing  in  a  highly  frequently  exe- 
cuted process  should  include  more  fault-tolerance. 
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A  simple  analysis  for  the  effectiveness  of  the  approach  can  be  made  by 
exercising  the  model.  The  example  real-time  software  system  which  controls 
the  attitude  of  a  satellite  is  taken  from  [22].  The  request  period  (t  ),  fre- 
quency count  (n.),  number  of  segments  (N .  ) ,  approximated  segment  execution 
probabilities  (e,  's)  and  g.,.,'s  for  each  process  are  shown  in  Table  1.  Each 
segment  consists  of  no  more  than  10  Path  Pascal  statements. 


Process 

Name 

ci 

ni 

Ni 

6ij 

8ij 

1 

Read  Gyro 

125  ms 

8 

1 

1.0 

8.0 

2 

Inertial  Wheel 

125  ms 

8 

1 

1.0 

8.0 

3 

Gyro  Process 

0.5  sec 

2 

21 

6  of  0.4 

2  of  0.5 

13  of  1.0 

0.8 
1.0 
2.0 

4 

Read  Sensors 

1  sec 

1 

1 

1.0 

1.0 

5 

Attitude  Determi- 
nation &  control 

1  sec 

1 

33 

6  of  0.25 

13  of  0.5 

14  of  1.0 

0.25 

0.5 

1.0 

Table  1. 


Without  loss  of  generality,  indices  of  the  segments  are  assumed  to  be 
an  increasing  order  of  the  values  of  e   's.   Substituting  the  figures  in  Table 


1  into  (7)  we  have 


19 


33 


6 

Z 
j-1  "J   j-7 


F  ,   8  fn  +  8  f^  +  0.8  Z   f3J  +^f3j  +  2Zq   f3j  +  f41  +  0.25^  ■ 


21 

Z 
j=9 


6 
Z 


0.5  Z  f_.  +  Z     f 
j-7  DJ   j-20  D3 


,-e 


If  the  system  failure  probability  is  required  to  be   10    (which  is 

approximately  equivalent  to  MTBF  »  11  days),  then  each  f   needs  to  be  in  the 

—7      —8 
order  of  10   or  10  .   By  applying  recovery  blocks  and  the  two-out-of- three 

majority  voting  units  properly  to  the  segments,  the  failure  probabilities  of 


-3 


.-4 


the  primaries  or  alternates  can  be  constructed  in  the  order  of  10   or  10  . 
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5   Conclusions. 

Fault-tolerance  complements  the  fault-avoidance  approach  and  further 
improves  the  systems  reliability.  In  this  paper,  a  construction  of  a  fault- 
tolerant  real-time  software  system  by  top-down  modular  design  and  stepwise 
refinement  techniques  is  presented.  Various  programming  tools  for  fault- 
tolerance  in  real-time  systems  are  discussed.  A  simplified  mathematical  model 
to  estimate  the  effectiveness  of  the  approach  is  also  described.  Though  accu- 
rate evaluation  of  the  reliability  of  a  software  component  is  difficult,  the 
model  supports  the  application  of  modular  design  and  stepwise  refinement  to 
fault-tolerance  in  construction  of  reliable  real-time  software.  The  model 
suggests  that  large,  highly  reliable  systems  can  be  built  from  segments  of 
software  that  are  as  reliable  as  current  software  engineering  primitives 
(testing,  verification,  etc.)  permit. 
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